Graph-Based N-gram Language Identification on Short Texts
نویسندگان
چکیده
Language identification (LI) is an important task in natural language processing. Several machine learning approaches have been proposed for addressing this problem, but most of them assume relatively long and well written texts. We propose a graph-based N-gram approach for LI called LIGA which targets relatively short and ill-written texts. The results of our experimental study show that LIGA outperforms the state-of-the-art N-gram approach on Twitter messages LI.
منابع مشابه
Implementation and Evaluation of a Language Identification System for Mono- and Multi-lingual Texts
Language identification is a classification task between a pre-defined model and a text in an unknown language. This paper presents the implementation of a tool for language identification for mono-and multilingual documents. The tool includes four algorithms for language identification. An evaluation for eight languages including Ukrainian and Russian and various text lengths is presented. It ...
متن کاملLanguage Identification on the Web: Extending the Dictionary Method
Automated language identification of written text is a wellestablished research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character n-grams are in use, mainly with identification based on Markov models or on character n-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world...
متن کاملAutomatic identification of language varieties: The case of Portuguese
Automatic Language Identification of written texts is a well-established area of research in Computational Linguistics. Stateof-the-art algorithms often rely on n-gram character models to identify the correct language of texts, with good results seen for European languages. In this paper we propose the use of a character n-gram model and a word n-gram language model for the automatic classifica...
متن کاملUnsupervised Clustering for Language Identification
The current state of the art in language identification comes from n-gram language models. While these can reach 99% accuracy (Hammarstrom, 2007), they have three major shortcomings. First, n-gram language models are supervised. They require substantial labeled training data in each language in order to be functional. For best results, this training data should also be in the same genre as the ...
متن کاملEntity Recognition and Language Identification with FELTS
This working notes describe the experiments we conducted in the Microblog Cultural Contextualization Lab [2] of CLEF 2017 [3]. The microblog data is composed of very short texts, with very heterogeneous styles. Some of them are written in more than one language. We decided to takle the entity recognition problem by using a non-statistical, dictionary-based, multiword term extractor. On the othe...
متن کامل